Skip to content

[dylink] Fix deadlock in _emscripten_dlsync_threads()#27000

Merged
sbc100 merged 19 commits into
emscripten-core:mainfrom
kleisauke:issue-26913-fix
May 28, 2026
Merged

[dylink] Fix deadlock in _emscripten_dlsync_threads()#27000
sbc100 merged 19 commits into
emscripten-core:mainfrom
kleisauke:issue-26913-fix

Conversation

@kleisauke
Copy link
Copy Markdown
Collaborator

@kleisauke kleisauke commented May 23, 2026

This extends commit 662cb06 by ensuring _emscripten_thread_notify()
is also called for the synchronous _emscripten_dlsync_threads()
path. This ensures that threads process the dlopen catch-up queue
promptly, even when blocked in emscripten_futex_wait().

Fixes: #26913.

Comment thread system/lib/pthread/proxying.c Outdated
DBG("waking main runtime thread using _emscripten_thread_notify");
if (ret) {
// Wake up the target thread in case it is blocked in
// `emscripten_futex_wait`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is little confusing (and maybe wasteful) though since normal pthread do not run their proxy queue in emscripten_yield, and therefore wakeing them them up should not be needed.

pthreads do however call _emscripten_process_dlopen_queue from emscripten_yield .. so it should only be necessary to wake them up if/when need need to call _emscripten_process_dlopen_queue, and not whereever work gets proxied to them.

Copy link
Copy Markdown
Collaborator Author

@kleisauke kleisauke May 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if _emscripten_thread_notify() is actually that expensive, since it already returns early when the thread isn't waiting.

I tried limiting this notify to the dlopen() proxying queue with commit 72cb72a, but reintroducing && pthread_equal(target_thread, emscripten_main_runtime_thread_id()) regressed this again.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pthreads do however call _emscripten_process_dlopen_queue from emscripten_yield [...]

I think that's probably why reintroducing && pthread_equal(target_thread, emscripten_main_runtime_thread_id()) caused this regression again, as the queue is processed from pthreads rather than from the main runtime thread.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really worried about the code of doing the _emscripten_thread_notify. But I would really like to know why it is needed. Because pthreads don't run their messge queue on wakeup I don't see any reason to wake them when we put something in the queue.

pthreads should only need to wakeup if they are need to process the dlopen queue, right?

Copy link
Copy Markdown
Collaborator Author

@kleisauke kleisauke May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pthreads should only need to wakeup if they are need to process the dlopen queue, right?

Correct, the latest revision of this PR already restricts wakeups to the dlopen queue.

sbc100 pushed a commit that referenced this pull request May 23, 2026
kleisauke added 3 commits May 23, 2026 22:39
This extends commit 662cb06 by ensuring `_emscripten_thread_notify()`
is also called for the synchronous `_emscripten_dlsync_threads()`
path. This ensures that threads process the dlopen catch-up queue
promptly, even when blocked in `emscripten_futex_wait()`.

To implement this cleanly, `dlopen_proxying_queue` is moved from a
dynamic pointer in `dynlink.c` to a statically initialized queue in
`proxying.c`.

Fixes: emscripten-core#26913.
This is an automatic change generated by tools/maint/rebaseline_tests.py.

The following (2) test expectation files were updated by
running the tests with `--rebaseline`:

```
codesize/test_codesize_minimal_pthreads.json: 26180 => 26179 [-1 bytes / -0.00%]
codesize/test_codesize_minimal_pthreads_memgrowth.json: 26589 => 26588 [-1 bytes / -0.00%]

Average change: -0.00% (-0.00% - -0.00%)
```
@kleisauke kleisauke changed the title [dylink] Fix deadlock in synchronous _emscripten_dlsync_threads() path [dylink] Fix deadlock in _emscripten_dlsync_threads() May 23, 2026
kleisauke added 7 commits May 26, 2026 10:56
This is an automatic change generated by tools/maint/rebaseline_tests.py.

The following (2) test expectation files were updated by
running the tests with `--rebaseline`:

```
codesize/test_codesize_minimal_pthreads.json: 26158 => 26157 [-1 bytes / -0.00%]
codesize/test_codesize_minimal_pthreads_memgrowth.json: 26567 => 26566 [-1 bytes / -0.00%]

Average change: -0.00% (-0.00% - -0.00%)
```
@sbc100
Copy link
Copy Markdown
Collaborator

sbc100 commented May 27, 2026

I'd really like to get this fixed for the 6.0.0 release, but I'd also love to understand the issue better rather then just always waking pthreads when they we add a message to their queue (which still seem like it should not be needed in theory).

kleisauke added 2 commits May 28, 2026 12:31
This is an automatic change generated by tools/maint/rebaseline_tests.py.

The following (2) test expectation files were updated by
running the tests with `--rebaseline`:

```
codesize/test_codesize_minimal_pthreads.json: 26147 => 26146 [-1 bytes / -0.00%]
codesize/test_codesize_minimal_pthreads_memgrowth.json: 26556 => 26555 [-1 bytes / -0.00%]

Average change: -0.00% (-0.00% - -0.00%)
```
@kleisauke
Copy link
Copy Markdown
Collaborator Author

kleisauke commented May 28, 2026

[...] just always waking pthreads when they we add a message to their queue (which still seem like it should not be needed in theory).

The latest revision of this PR already limits the _emscripten_thread_notify() call to the dlopen queue (and the PR description has been updated to reflect this change).

Comment thread test/test_core.py Outdated
self.set_setting('EXIT_RUNTIME')
self.set_setting('PROXY_TO_PTHREAD')
# Uncomment to test _emscripten_proxy_dlsync_async()
# self.set_setting('PROXY_TO_PTHREAD')
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run this in both modes using @parameterized ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done via 4ef62da.

Copy link
Copy Markdown
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for all the work on this.

This solution looks correct to me. I'm a little sad that the core proxying mechanism needs to be aware of the dynlink queue, but I'm not sure I see any way around it.

Comment thread system/lib/libc/dynlink.c
// `_emscripten_proxy_dlsync` below, and processed by background threads
// that call `_emscripten_process_dlopen_queue` during futex_wait (i.e. whenever
// they block).
static em_proxying_queue * _Atomic dlopen_proxying_queue = NULL;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we leave the declaration of this queue here in this file? (and declare a getter for it?)

In fact, could we just declare the queue itself as extern in proxying.c‎ (avoiding the getting completely).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised this to use an extern declaration with commit 4ef62da.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this seems to regress code size, see e.g. commit 4d7842b.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean moving back to this file causes a regression? I think co-locating it here does make the most sense unless its a big regression, which would seem very odd.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, the code size expectations for the main branch need to be rebaselined.
https://github.com/emscripten-core/emscripten/actions/runs/26584327538/job/78325943846?pr=27000

Copy link
Copy Markdown
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments

Comment thread system/lib/pthread/proxying.c Outdated

// Proxying via the dlopen or system queue may target a thread that is
// currently blocked in `emscripten_futex_wait`, so explicitly wake it
// after enqueueing the task.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you leave the original sentence in place and then add a second one: "In addition, we have a special case here for the dlopen_proxying_queue...". Maybe with a TODO do find way to avoid the need for this special case?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised this comment with 4ef62da.

kleisauke added 2 commits May 28, 2026 17:22
This is an automatic change generated by tools/maint/rebaseline_tests.py.

The following (2) test expectation files were updated by
running the tests with `--rebaseline`:

```
codesize/test_codesize_minimal_pthreads.json: 26146 => 26179 [+33 bytes / +0.13%]
codesize/test_codesize_minimal_pthreads_memgrowth.json: 26555 => 26588 [+33 bytes / +0.12%]

Average change: +0.13% (+0.12% - +0.13%)
```
Copy link
Copy Markdown
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but can you move the declaration of the queue back to ‎system/lib/libc/dynlink.c?

@kleisauke
Copy link
Copy Markdown
Collaborator Author

but can you move the declaration of the queue back to ‎system/lib/libc/dynlink.c?

Unfortunately, this won't work because em_proxying_queue is only forward-declared and therefore an incomplete type.

../../../system/lib/libc/dynlink.c:362:19: error: variable has incomplete type 'em_proxying_queue' (aka 'struct em_proxying_queue')
  362 | em_proxying_queue _dlopen_proxying_queue = {
      |                   ^
/home/kleisauke/emscripten/cache/sysroot/include/emscripten/proxying.h:29:16: note: forward declaration of 'struct em_proxying_queue'
   29 | typedef struct em_proxying_queue em_proxying_queue;
      |                ^
1 error generated.

@sbc100
Copy link
Copy Markdown
Collaborator

sbc100 commented May 28, 2026

but can you move the declaration of the queue back to ‎system/lib/libc/dynlink.c?

Unfortunately, this won't work because em_proxying_queue is only forward-declared and therefore an incomplete type.

../../../system/lib/libc/dynlink.c:362:19: error: variable has incomplete type 'em_proxying_queue' (aka 'struct em_proxying_queue')
  362 | em_proxying_queue _dlopen_proxying_queue = {
      |                   ^
/home/kleisauke/emscripten/cache/sysroot/include/emscripten/proxying.h:29:16: note: forward declaration of 'struct em_proxying_queue'
   29 | typedef struct em_proxying_queue em_proxying_queue;
      |                ^
1 error generated.

Hmm. thats annoying. I guess we would need some kind of way to statically initialize a proxy queue.

Its not great that the dynlink system leaks into the core prozying system like this though, so I'm not sure the advantage of the static initializer outweigh the disadvantage of the separation of concerns here.

I'm hoping to remove all reference to dlopen queue from the core proxying.c (as a followup perhaps) so maybe just keeping this as lazily initialized in dynlink.c is more condusive to that future?

@sbc100
Copy link
Copy Markdown
Collaborator

sbc100 commented May 28, 2026

Maybe you can use pthread_once to initialize it in its getter function?

@sbc100
Copy link
Copy Markdown
Collaborator

sbc100 commented May 28, 2026

Or maybe C11 call_once?

@kleisauke
Copy link
Copy Markdown
Collaborator Author

Actually, we can trivially move it back to dylink.c. See commit e178d8e. I've also updated the PR description accordingly.

kleisauke added 2 commits May 28, 2026 18:45
Since accessing a shared variable from multiple threads without
synchronization (where at least one access is a write) is always
a data race and causes UB.
@sbc100
Copy link
Copy Markdown
Collaborator

sbc100 commented May 28, 2026

Sorry the codesize tests just got rebased. Can you rebase one more time?

@sbc100
Copy link
Copy Markdown
Collaborator

sbc100 commented May 28, 2026

Just to confirm, with the changes to test_pthread_dlopen, it will now fail without this fix? Is it just one variant of it in particular requires this fix?

@sbc100 sbc100 enabled auto-merge (squash) May 28, 2026 19:11
@kleisauke
Copy link
Copy Markdown
Collaborator Author

Just to confirm, with the changes to test_pthread_dlopen, it will now fail without this fix?

I can confirm that browser.test_pthread_dlopen (as well as core2.test_pthread_dlopen) will timeout/fail without this fix, whereas core2.test_pthread_dlopen_proxied is unaffected.

Is it just one variant of it in particular requires this fix?

Yup, this fix was only required for the synchronous path in _emscripten_dlsync_threads(). I guess there was previously missing test coverage for this (which is also why PR #27001 was needed).

@sbc100 sbc100 disabled auto-merge May 28, 2026 19:36
@sbc100 sbc100 merged commit 0d68ae8 into emscripten-core:main May 28, 2026
37 of 39 checks passed
@kleisauke kleisauke deleted the issue-26913-fix branch May 28, 2026 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Possible deadlock in _emscripten_dlsync_threads() since #26659

2 participants